Skip to content

Conversation

@MasterPtato
Copy link
Contributor

No description provided.

Copy link
Contributor Author

MasterPtato commented Jan 9, 2026

Warning

This pull request is not mergeable via GitHub because a downstack PR is open. Once all requirements are satisfied, merge this PR as a stack on Graphite.
Learn more


How to use the Graphite Merge Queue

Add the label merge-queue to this PR to add it to the merge queue.

You must have a Graphite account in order to use the merge queue. Sign up using this link.

An organization admin has enabled the Graphite Merge Queue in this repository.

Please do not merge from GitHub as this will restart CI on PRs being processed by the merge queue.

This stack of pull requests is managed by Graphite. Learn more about stacking.

@pkg-pr-new
Copy link

pkg-pr-new bot commented Jan 9, 2026

More templates

@rivetkit/cloudflare-workers

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/cloudflare-workers@3798

@rivetkit/db

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/db@3798

@rivetkit/framework-base

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/framework-base@3798

@rivetkit/next-js

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/next-js@3798

@rivetkit/react

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/react@3798

rivetkit

pnpm add https://pkg.pr.new/rivet-dev/rivet/rivetkit@3798

@rivetkit/sql-loader

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/sql-loader@3798

@rivetkit/virtual-websocket

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/virtual-websocket@3798

@rivetkit/engine-runner

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/engine-runner@3798

@rivetkit/engine-runner-protocol

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/engine-runner-protocol@3798

commit: 821194e

@MasterPtato MasterPtato force-pushed the 01-08-fix_clear_all_pending_signals_after_workflow_complete branch from 865be47 to 270fad1 Compare January 10, 2026 02:00
@MasterPtato MasterPtato force-pushed the 01-08-fix_remove_pending_actors_metric_fix_actor_error_tracker_engine_runner_error_print branch from 1ba3107 to 22e1cee Compare January 10, 2026 02:00
@vercel
Copy link

vercel bot commented Jan 10, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Review Updated (UTC)
rivet-cloud Ready Ready Preview, Comment Jan 10, 2026 2:02am
rivetkit-serverless Error Error Jan 10, 2026 2:02am
2 Skipped Deployments
Project Deployment Review Updated (UTC)
rivet-inspector Ignored Ignored Preview Jan 10, 2026 2:02am
rivet-site Ignored Ignored Preview Jan 10, 2026 2:02am

@MasterPtato MasterPtato force-pushed the 01-08-fix_clear_all_pending_signals_after_workflow_complete branch from 270fad1 to d77de78 Compare January 12, 2026 19:08
@MasterPtato MasterPtato force-pushed the 01-08-fix_remove_pending_actors_metric_fix_actor_error_tracker_engine_runner_error_print branch from 22e1cee to 2701da3 Compare January 12, 2026 19:08
@MasterPtato MasterPtato force-pushed the 01-08-fix_clear_all_pending_signals_after_workflow_complete branch from d77de78 to 4ce772a Compare January 13, 2026 00:44
@MasterPtato MasterPtato force-pushed the 01-08-fix_remove_pending_actors_metric_fix_actor_error_tracker_engine_runner_error_print branch from 2701da3 to 7057252 Compare January 13, 2026 00:44

tx.write(&keys::signal::AckTsKey::new(key.signal_id), now)?;

update_metric(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

on publish signal:

  • incr global count
  • incr per-workflow count

on workflow complete:

  • decr global count by per-workflow count
  • do not ack signals themselves

@MasterPtato MasterPtato force-pushed the 01-08-fix_clear_all_pending_signals_after_workflow_complete branch from 4ce772a to d7548da Compare January 14, 2026 02:05
@MasterPtato MasterPtato force-pushed the 01-08-fix_remove_pending_actors_metric_fix_actor_error_tracker_engine_runner_error_print branch from 7057252 to 373f6cf Compare January 14, 2026 03:06
@MasterPtato MasterPtato force-pushed the 01-08-fix_clear_all_pending_signals_after_workflow_complete branch from d7548da to a129485 Compare January 14, 2026 03:06
@MasterPtato MasterPtato force-pushed the 01-08-fix_remove_pending_actors_metric_fix_actor_error_tracker_engine_runner_error_print branch from 373f6cf to ff1afac Compare January 14, 2026 19:47
@MasterPtato MasterPtato force-pushed the 01-08-fix_clear_all_pending_signals_after_workflow_complete branch from a129485 to 45ec780 Compare January 14, 2026 19:47
Copy link
Member

@NathanFlurry NathanFlurry left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could benefit from some comments explaining what's going on, it's really really hard to follow which keys do what

Copy link
Member

@NathanFlurry NathanFlurry left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see above

@MasterPtato MasterPtato force-pushed the 01-08-fix_remove_pending_actors_metric_fix_actor_error_tracker_engine_runner_error_print branch from ff1afac to 5d2fb4a Compare January 14, 2026 23:02
@MasterPtato MasterPtato force-pushed the 01-08-fix_clear_all_pending_signals_after_workflow_complete branch 2 times, most recently from 8041bfa to 01b117b Compare January 14, 2026 23:07
@claude
Copy link

claude bot commented Jan 14, 2026

PR Review: Fix Clear All Pending Signals After Workflow Complete

Summary

This PR fixes an issue where pending signals weren't being properly cleared when workflows completed. The fix introduces per-workflow gauge metrics to track pending signals and ensures they're decremented from global metrics during workflow completion.

Key Changes

1. New Per-Workflow Metrics System (keys/workflow.rs)

Added a new GaugeMetricKey structure to track metrics at the workflow level:

  • Mirrors the global keys::metric::GaugeMetric structure
  • Currently only tracks SignalPending metrics per workflow
  • Uses proper tuple packing/unpacking for database serialization

2. Global Metric Migration (keys/metric.rs)

3. Workflow Completion Logic (db/kv/mod.rs)

Enhanced complete_workflow to:

  • Read all per-workflow pending signal metrics
  • Decrement global metrics by the accumulated counts
  • Log when signals are cleared for debugging
  • Return count of cleared signals

4. Dual Metric Tracking

Both dispatch_signal and signal acknowledgment now update:

  • Global metrics (keys::metric::GaugeMetric::SignalPending2)
  • Per-workflow metrics (keys::workflow::GaugeMetric::SignalPending)

Code Quality Assessment

Strengths

  1. Well-structured solution: The dual-tracking approach (global + per-workflow) provides both observability and correctness
  2. Proper cleanup: Workflow completion now properly accounts for all pending signals
  3. Backward compatibility: Deprecation markers instead of breaking changes
  4. Good observability: Added debug logging for signal clearing
  5. Consistent patterns: Follows existing metric update patterns

⚠️ Issues & Concerns

1. Type Safety Issue in Metric Count Handling (CRITICAL)

Location: db/kv/mod.rs:1868

// Ignore negatives and zero
if isize::from_le_bytes(metric_count.to_le_bytes()) <= 0 {
    continue;
}

Problem: This performs a round-trip conversion (usize → bytes → isize) which can produce incorrect results:

  • A usize value larger than isize::MAX will be interpreted as negative
  • This is semantically unclear and potentially fragile

Recommendation:

// Ignore zero and skip processing (negative values shouldn't exist)
if metric_count == 0 {
    continue;
}

If negative values are truly expected in the database, there's a deeper issue with the data model since GaugeMetricKey::Value is defined as usize.


2. Data Duplication and Consistency Risk (HIGH)

Location: Multiple locations in db/kv/mod.rs

Problem: Pending signals are now tracked in TWO places:

  • Global: keys::metric::GaugeMetric::SignalPending2
  • Per-workflow: keys::workflow::GaugeMetric::SignalPending

This creates several risks:

  1. Race conditions: The two metrics are updated in separate atomic operations (lines 227-236)
  2. Consistency drift: If one update succeeds and another fails, metrics become inconsistent
  3. Performance overhead: Doubling the number of database operations for every signal

Current Implementation:

// Global metric update
update_metric(
    &tx.with_subspace(self.subspace.clone()),
    None,
    Some(keys::metric::GaugeMetric::SignalPending2(signal_name.to_string())),
);

// Per-workflow metric update (separate operation)
update_wf_metric(
    &tx.with_subspace(self.subspace.clone()),
    workflow_id,
    None,
    Some(keys::workflow::GaugeMetric::SignalPending(signal_name.to_string())),
);

Recommendations:

  1. Short-term: Document that both metrics must always be updated together
  2. Medium-term: Consider batching these operations or using a transaction guard
  3. Long-term: Evaluate if global metrics can be derived from per-workflow metrics via aggregation queries instead of maintaining duplicates

3. Migration Path Unclear (MEDIUM)

Location: keys/metric.rs

Problem: The old SignalPending (variant #4) is marked deprecated, but:

  • No migration code to convert old metrics to new format
  • No timeline for removing deprecated variant
  • Metrics exposition at line 746 ignores old variant entirely:
    keys::metric::GaugeMetric::SignalPending(_) => {}  // Silent no-op
    keys::metric::GaugeMetric::SignalPending2(signal_name) => { /* ... */ }

Questions:

  • Are there existing workflows with SignalPending (v1) metrics in the database?
  • Will they be orphaned when workflows complete?
  • Should there be a background migration job?

Recommendation: Add a comment explaining the migration strategy or confirm that this is a new deployment without legacy data.


4. Resource Cleanup Could Be Optimized (LOW)

Location: db/kv/mod.rs:1846-1883

Current approach:

// Stream all metrics for this workflow
let mut stream = tx.get_ranges_keyvalues(/* ... */);
loop {
    let Some(entry) = stream.try_next().await? else { break; };
    // Process each metric
}

Observation: This streams ALL gauge metrics for a workflow, but currently only SignalPending exists. As more metric types are added, this could become inefficient.

Recommendation: Consider more targeted cleanup (e.g., specific key range for SignalPending) or document that this pattern is intentional for extensibility.


5. Inconsistent String Allocation (LOW)

Location: db/kv/mod.rs:2191, 2199

// Line 2191 - clone()
key.signal_name.clone(),

// Line 227 - to_string()
signal_name.to_string(),

Issue: Mixing .clone() and .to_string() for the same type (String) is inconsistent and may confuse readers.

Recommendation: Use .clone() consistently since signal_name is already a String.


6. Unrelated Changes in PR (LOW)

Location: Multiple files

Unrelated changes observed:

  1. pegboard/workflows/runner.rs:11 - Removed unused import
  2. pegboard/workflows/runner2.rs:11 - Removed unused import
  3. serverless/conn.rs:3 - Changed import style
  4. universalpubsub/driver/postgres/mod.rs:392 - Changed log level debug→trace
  5. workflow-worker/Cargo.toml:9-15 - Reformatted dependencies
  6. scripts/debug/decode_runner_config.js - New utility script

Recommendation: These changes should be in separate commits or PRs for cleaner history, especially the new JavaScript decoder script which appears completely unrelated.


Performance Considerations

Acceptable

  • Workflow completion is infrequent relative to signal dispatch
  • Streaming metrics during cleanup is reasonable for small counts
  • Atomic operations are efficient

⚠️ Watch Out For

  • High-frequency workflows: If workflows complete frequently with many pending signals, the metric cleanup loop (lines 1858-1883) could become a bottleneck
  • Metric accumulation: Each signal dispatch now does 2× database writes instead of 1×

Security Concerns

No Major Issues

  • No user input directly incorporated into database keys
  • Signal names are already controlled by the workflow system
  • Proper use of typed database operations

Test Coverage

Missing Tests

The PR adds 424 lines but includes no test changes. Critical scenarios needing tests:

  1. Workflow completion clears pending signals correctly

    • Workflow with N pending signals completes
    • Verify global metric decremented by N
    • Verify per-workflow metrics deleted
  2. Metric consistency

    • Signal dispatched → both metrics incremented
    • Signal acknowledged → both metrics decremented
    • Workflow completes → both metrics cleaned up
  3. Edge cases

    • Workflow completes with 0 pending signals
    • Workflow completes with very large pending signal count
    • Multiple signal types per workflow
  4. Migration scenario

    • Existing workflows with old SignalPending metrics
    • Verify they don't cause errors during completion

Recommendation: Add integration tests covering at least the happy path before merging.


Best Practices Alignment

Follows CLAUDE.md Guidelines

  • Uses workspace dependencies correctly
  • Proper error handling with Result
  • Structured logging with tracing
  • Lowercase log messages
  • Comments as complete sentences (mostly)

⚠️ Minor Deviations

  • Some comments could be more concise (e.g., line 1846 comment is quite long)
  • The deprecation comments could follow a more structured format

Recommendations Summary

Must Fix Before Merge (CRITICAL)

  1. Fix type safety issue at line 1868 - use direct usize comparison instead of round-trip conversion

Should Fix Before Merge (HIGH)

  1. Add test coverage for metric cleanup during workflow completion
  2. Document or resolve the dual-metric consistency risk
  3. Clarify migration path for deprecated SignalPending variant

Consider for Follow-up (MEDIUM-LOW)

  1. Split unrelated changes into separate commits/PRs
  2. Optimize metric cleanup if many metric types are planned
  3. Use consistent string allocation (.clone() vs .to_string())
  4. Add monitoring/alerting for metric drift between global and per-workflow counts

Verdict

The PR addresses a real bug and the overall approach is sound, but there are critical issues that should be addressed:

  1. Type safety bug in negative value handling
  2. Lack of test coverage for the core functionality
  3. Potential consistency issues with dual metric tracking

Recommendation: Request changes before merging.


Additional Questions for Author

  1. Have you tested this with a workflow that has many (e.g., 1000+) pending signals?
  2. What happens if the transaction fails after updating one metric but before updating the other?
  3. Is there a plan to migrate existing SignalPending (v1) metrics, or is this a greenfield deployment?
  4. Why was the JavaScript decoder script included in this PR?

@MasterPtato MasterPtato force-pushed the 01-08-fix_clear_all_pending_signals_after_workflow_complete branch from 01b117b to 821194e Compare January 14, 2026 23:39
@MasterPtato MasterPtato force-pushed the 01-08-fix_remove_pending_actors_metric_fix_actor_error_tracker_engine_runner_error_print branch from 5d2fb4a to 6415162 Compare January 14, 2026 23:39
@graphite-app
Copy link
Contributor

graphite-app bot commented Jan 14, 2026

Merge activity

  • Jan 14, 11:40 PM UTC: MasterPtato added this pull request to the Graphite merge queue.
  • Jan 14, 11:41 PM UTC: CI is running for this pull request on a draft pull request (#3908) due to your merge queue CI optimization settings.
  • Jan 14, 11:42 PM UTC: Merged by the Graphite merge queue via draft PR: #3908.

@graphite-app graphite-app bot closed this Jan 14, 2026
@graphite-app graphite-app bot deleted the 01-08-fix_clear_all_pending_signals_after_workflow_complete branch January 14, 2026 23:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants